Get materials
The materials in this document are hosted on Github:
https://github.com/ontox-hu/course_R_AI/extra
To get them locally (in RStudio)
- Start a new project (
File–>New Project) - Choose
Version Control–>Git - Paste the above url in the
Repository URLfield - Click
OK - The materials will be cloned to your account or local computer
The rendered HTML document is also hosted on our RStudio::CONNECT server at:
Reference
https://github.com/rbind/simplystats
link to free online book “r4ds”: “R for Data Science”, Hadley Wickham & Garret Grolemund
Gapminder
https://www.gapminder.org/tools/#$state$time$value=2018;;&chart-type=bubbles
ggplot2
The ggplot2 package in R is the best plotting system for R. It’s
syntax is an implementation of the ‘grammar of graphics’ and all plots
in this demo are created by using the {ggplot2} R
package.
Citations
citation(package = "ggplot2")
##
## To cite ggplot2 in publications, please use:
##
## H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
## Springer-Verlag New York, 2016.
##
## A BibTeX entry for LaTeX users is
##
## @Book{,
## author = {Hadley Wickham},
## title = {ggplot2: Elegant Graphics for Data Analysis},
## publisher = {Springer-Verlag New York},
## year = {2016},
## isbn = {978-3-319-24277-4},
## url = {https://ggplot2.tidyverse.org},
## }
citation(package = "tidyverse")
##
## To cite package 'tidyverse' in publications use:
##
## Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL,
## Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu
## V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019).
## "Welcome to the tidyverse." _Journal of Open Source Software_,
## *4*(43), 1686. doi:10.21105/joss.01686
## <https://doi.org/10.21105/joss.01686>.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {Welcome to the {tidyverse}},
## author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
## year = {2019},
## journal = {Journal of Open Source Software},
## volume = {4},
## number = {43},
## pages = {1686},
## doi = {10.21105/joss.01686},
## }
citation(package = "dslabs")
##
## To cite package 'dslabs' in publications use:
##
## Irizarry RA, Gill A (2021). _dslabs: Data Science Labs_. R package
## version 0.7.4, <https://CRAN.R-project.org/package=dslabs>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {dslabs: Data Science Labs},
## author = {Rafael A. Irizarry and Amy Gill},
## year = {2021},
## note = {R package version 0.7.4},
## url = {https://CRAN.R-project.org/package=dslabs},
## }
##
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.
Installing data package
install.packages("dslabs")
Loading the data package {dslabs} and other packages
used
library(tidyverse)
library(dslabs)
library(lubridate)
Datasets
Included in the {dslabs} package
data(package="dslabs")
Getting original data wrangling scripts
Included also in the {dslabs} package
list.files(system.file("script", package = "dslabs"))
## [1] "make-admissions.R"
## [2] "make-brca.R"
## [3] "make-brexit_polls.R"
## [4] "make-death_prob.R"
## [5] "make-divorce_margarine.R"
## [6] "make-gapminder-rdas.R"
## [7] "make-greenhouse_gases.R"
## [8] "make-historic_co2.R"
## [9] "make-mnist_27.R"
## [10] "make-movielens.R"
## [11] "make-murders-rda.R"
## [12] "make-na_example-rda.R"
## [13] "make-nyc_regents_scores.R"
## [14] "make-olive.R"
## [15] "make-outlier_example.R"
## [16] "make-polls_2008.R"
## [17] "make-polls_us_election_2016.R"
## [18] "make-reported_heights-rda.R"
## [19] "make-research_funding_rates.R"
## [20] "make-stars.R"
## [21] "make-temp_carbon.R"
## [22] "make-tissue-gene-expression.R"
## [23] "make-trump_tweets.R"
## [24] "make-weekly_us_contagious_diseases.R"
## [25] "save-gapminder-example-csv.R"
Demo dataset
data("gapminder", package = "dslabs")
## ?gapminder for more info on the variables in the dataset
The gapminder dataset contains a number of measurements on health and income outcomes for 184 countries from 1960 to 2016. It also includes two character vectors, OECD and OPEC, with the names of OECD and OPEC countries from 2016.
Inspecting the gapminder dataset with R
gapminder <- gapminder %>% as_tibble()
gapminder %>% head(2)
## # A tibble: 2 × 9
## country year infant_mort…¹ life_…² ferti…³ popul…⁴ gdp conti…⁵ region
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 Albania 1960 115. 62.9 6.19 1.64e6 NA Europe South…
## 2 Algeria 1960 148. 47.5 7.65 1.11e7 1.38e10 Africa North…
## # … with abbreviated variable names ¹infant_mortality, ²life_expectancy,
## # ³fertility, ⁴population, ⁵continent
names(gapminder)
## [1] "country" "year" "infant_mortality"
## [4] "life_expectancy" "fertility" "population"
## [7] "gdp" "continent" "region"
A very simple example to start with
gapminder %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point()
This is a very dense plot
We call this ‘overplotting’.
This can be fixed in several ways:
- Reducing the transparency of data points
- Mapping colour to a variable (continuous or categorical)
- Reduce the data in the plot
- Mapping a shape to a variable
- Add noise (
"jitter") to points- Facetting - create panels for ‘categorical’ or so-called ‘factor’ variables in R
- Summarize the data
- Displaying a model / relationship that represents the data (and not sho the actual data itself)
- Or any combination of the above strategies
Basically you map an aesthetic
(aes()) to a variable
Let’s go over these overplotting methods one by one
1. Overplotting: Reducing transparency (alpha) of
points or lines in the data
gapminder %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point(alpha = 0.1)
2. Mapping colour to a variable (continuous or categorical)
gapminder %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point(aes(colour = continent))
or combined with alpha
gapminder %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point(aes(colour = continent), alpha = 0.1) +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
Do it yourself:
- Try adjusting some of the arguments in the previous
ggplot2call. For example, adjust thealpha = ...or change the variable inx = ...,y = ...orcolour = ... names(gapminder)gives you the variable names that you can change- Show and discuss the resulting plot with your neighbour
- What do you think this part does:
guides(colour = guide_legend(override.aes = list(alpha = 1)))
- Try to find out by disabling with
#
3. Reduce the data in the plot
reduce_data_plot <- gapminder %>%
dplyr::filter(continent == "Africa" | continent == "Europe") %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point(aes(colour = continent), alpha = 0.2) +
## override the alpha setting for the points in the legend:
guides(colour = guide_legend(override.aes = list(alpha = 1)))
Plot
reduce_data_plot
Discuss with you neighbour:
- What does the the
aes()part of thegeom_point()do? - Compare the code below with the code above, can you spot the difference, what is the advantage of the code below?
reduce_data_plot <- gapminder %>%
filter(continent == "Africa" | continent == "Europe") %>%
ggplot(aes(x = fertility,
y = life_expectancy, colour = continent)) +
geom_point(alpha = 0.2) +
## override the alpha setting for the points in the legend:
guides(colour = guide_legend(override.aes = list(alpha = 1)))
4. Mapping a shape to a variable
## or e.g. show only two years and map a shape to continent
shape_plot <- gapminder %>%
dplyr::filter(continent == "Africa" | continent == "Europe",
year == "1960" | year == "2010") %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point(aes(colour = as_factor(as.character(year)),
shape = continent),
alpha = 0.7)
Do it youself
- Try removing the
as_factor(as.character(year))call and replace this by onlyyearabove and rerun the plot, what happened?
Plot
shape_plot
5. Facetting
Create panels for ‘categorical’ or so-called ‘factor’ variables in R
facets_plot <- gapminder %>%
dplyr::filter(continent == "Africa" | continent == "Europe",
year == "1960" | year == "2010") %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point(aes(colour = continent), alpha = 0.5) +
facet_wrap(~ year)
Plot
facets_plot
6. Summarize the data
library(ggrepel)
years <- c("1960", "1970", "1980", "1990", "2000", "2010")
summarize_plot <- gapminder %>%
dplyr::filter(year %in% years) %>%
group_by(continent, year) %>%
summarise(mean_life_expectancy = mean(life_expectancy),
mean_fertility = mean(fertility)) %>%
ggplot(aes(x = mean_fertility,
y = mean_life_expectancy)) +
geom_point(aes(colour = continent), alpha = 0.7)
Plot
summarize_plot
Adding labels to the points with {ggrepel}
library(ggrepel)
years <- c("1960", "1970", "1980", "1990", "2000", "2010")
labels_plot <- gapminder %>%
dplyr::filter(year %in% years) %>%
group_by(continent, year) %>%
summarise(mean_life_expectancy = mean(life_expectancy),
mean_fertility = mean(fertility)) %>%
ggplot(aes(x = mean_fertility,
y = mean_life_expectancy)) +
geom_point(aes(colour = continent), alpha = 0.7) +
geom_label_repel(aes(label=year), size = 2.5, box.padding = .5)
Plot
labels_plot
7. Displaying a model / relationship that represents the data (and not show the actual data itself)
## Model
lm <- gapminder %>% lm(formula = life_expectancy ~ fertility)
correlation <- cor.test(x = gapminder$fertility,
y = gapminder$life_expectancy,
method = "pearson")
# save predictions of the model in the new data frame
# together with variable you want to plot against
predicted_df <- data.frame(gapminder_pred = predict(lm, gapminder),
fertility = gapminder$fertility)
Add model to plot
model_plot <- gapminder %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
# geom_point(alpha = 0.03) +
geom_line(data = predicted_df, aes(x = fertility,
y = gapminder_pred),
colour = "darkred", size = 1)
Plot
model_plot
Plotting statistics to the graph with the {ggpubr}
package
Using a smoother geom_smooth to display potential
relationships
gapminder %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point(alpha = 0.02) +
geom_smooth(method = "lm") +
stat_cor(method = "pearson", label.x = 2, label.y = 30) +
theme_bw()
Recap Discuss with your neighbour
Which tricks can we use to reduce the dimensionality of the plotted data (prevent overpltting)?
Try listing at least 6 methods:
Relation between gdp, Gross Domestic Product and
infant_mortality rate.
https://en.wikipedia.org/wiki/Gross_domestic_product Wikipedia: Gross Domestic Product (GDP) is a monetary measure of the market value of all the final goods and services produced in a period of time, often annually or quarterly. Nominal GDP estimates are commonly used to determine the economic performance of a whole country or region, and to make international comparisons.
gdp_infant_plot <- gapminder %>%
dplyr::filter(continent == "Europe" | continent == "Africa") %>%
ggplot(aes(x = gdp,
y = infant_mortality)) +
geom_point()
Plot
gdp_infant_plot
Adding a bit of colour
The figure above does not provide any clue on a possible difference between Europe and Africa, nor does it convey any information on trends over time.
colour_to_continent <- gapminder %>%
dplyr::filter(continent == "Europe" | continent == "Africa") %>%
ggplot(aes(x = gdp,
y = infant_mortality)) +
geom_point(aes(colour = continent))
Plot
colour_to_continent
Adding facets
Let’s investigate whether things have improved over time. We compare
1960 to 2010 by using a panel of two figures. Adding simply
facet_wrap( ~ facetting_variable) will do the trick.
Discuss with your neighbour:
Without looking ahead try to contruct a plot that conveys information
on the gdp per continent, over time. Try to recycle some of
the examples above.
facets_gdp_infant <- gapminder %>%
dplyr::filter(continent == "Europe" | continent == "Africa",
year == "1960" | year == "2010") %>%
ggplot(aes(x = gdp,
y = infant_mortality)) +
geom_point(aes(colour = continent)) +
facet_wrap(~ year) +
theme(axis.text.x = element_text(angle = -90, hjust = 1))
Plot
facets_gdp_infant
Mapping to continuous variables
So far we have been mapping colours and shapes to categorical variables. You can also map to continuous variables though.
continuous <- gapminder %>%
dplyr::filter(country == "Netherlands" |
country == "China" |
country == "India") %>%
dplyr::filter(year %in% years) %>%
ggplot(aes(x = year,
y = life_expectancy)) +
geom_point(aes(size = population, colour = country)) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
geom_line(aes(group = country)) +
theme_bw()
Plot
continuous
Discuss with your neighbour
Try plotting the infant_mortality against the filtered
years for the same countries as the code above (Netherlands, India,
China), recycling some of the code above. Discuss the resulting graph in
the light of the life_expectancy graph, what do you think about the the
developments in China?
Want to know more? see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331212/ Babxiarz, 2016
Discuss with your neighbour
Analyze the following code chunk: try running line by line to see what happens:
- How many observations are we plotting here?
- How many variables are we plotting?
- Try adding or removing variables to the
group_by()statement, what happens if you do?
Summarize per continent and sum population
population_plot <- gapminder %>%
dplyr::group_by(continent, year) %>%
dplyr::filter(year %in% years) %>%
summarise(sum_population = sum(population)) %>%
ggplot(aes(x = year,
y = sum_population)) +
geom_point(aes(colour = continent)) +
geom_line(aes(group = continent,
colour = continent))
Plot
population_plot
Ranking data
ranking_plot <- gapminder %>%
dplyr::filter(continent == "Europe",
year == 2010) %>%
ggplot(aes(x = reorder(as_factor(country), population),
y = log10(population))) +
geom_point() +
ylab("log10(Population)") +
xlab("Country") +
coord_flip() +
geom_point(data = filter(gapminder %>%
dplyr::filter(continent == "Europe",
year == 2010), population >= 1e7), colour = "red")
Plot
ranking_plot
Let’s look at a time series
We filter for “Americas” and “Oceania” and look at
life_expectancy over the years.
## without summarizing for countries
gapminder$continent %>% as_factor() %>% levels()
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
gapminder %>%
dplyr::filter(continent == "Americas" | continent == "Oceania") %>%
ggplot(aes(x = year,
y = life_expectancy)) +
geom_line(aes(group = continent,
colour = continent))
Obviously something went wrong here. Please, discuss with your neighbour what you think happened or needs to be done to fix this (without looking ahead ;-) )
Grouping
We can see what happened if we plot individual datapoints
gapminder %>%
dplyr::filter(continent == "Americas" | continent == "Oceania") %>%
ggplot(aes(x = year,
y = life_expectancy)) +
geom_point(aes(colour = country)) +
theme(legend.position="none") +
facet_wrap( ~ continent) +
theme(legend.position="none")
Summarizing time series data
gapminder$continent %>% as_factor() %>% levels()
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
gapminder %>%
dplyr::filter(continent == "Americas" | continent == "Oceania") %>%
group_by(continent, year) %>%
summarise(mean_life_expectancy = mean(life_expectancy)) %>%
ggplot(aes(x = year,
y = mean_life_expectancy)) +
geom_line(aes(group = continent,
colour = continent)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Statistical proof?
df <- gapminder %>%
dplyr::filter(continent == "Americas" | continent == "Oceania") %>%
group_by(continent, year)
model <- aov(data = df, life_expectancy ~ continent * year)
anova(model)
## Analysis of Variance Table
##
## Response: life_expectancy
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 1 8982 8982 269.104 <2e-16 ***
## year 1 58606 58606 1755.931 <2e-16 ***
## continent:year 1 9 9 0.278 0.5981
## Residuals 2732 91183 33
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Some remarks on the above Two-way ANOVA:
- Repeated measures / multilevel models might be more appropriate here (paired / nested designs)
- We did not perform any check on assumptions
- We performed our analysis on only part of the data
One more option: categorical values and “jitter”
Sometimes you have overlapping plots and adding transparency with
alpha() or mapping colour to underlying categorical values
is not working because there are simple to many points overlapping
Let’s look at an example
gapminder %>%
dplyr::filter(continent == "Americas" |
continent == "Africa") %>%
group_by(continent) %>%
dplyr::filter(year %in% years) %>%
ggplot(aes(x = year,
y = infant_mortality)) +
geom_point(aes(colour = country)) +
theme(legend.position="none")
In such cases it can be helpfull to add some noise to the points
(position = "jitter") to reduce overlapping. This can be a
powerfull approach, especially when combined with setting
alpha()
gapminder %>%
dplyr::filter(continent == "Americas" |
continent == "Africa") %>%
dplyr::filter(year %in% years) %>%
group_by(continent) %>%
ggplot(aes(x = year,
y = infant_mortality)) +
geom_point(aes(colour = continent), position = "jitter")
Bar chart
It would be nice to know what the mean child mortality is for both continents
gapminder %>%
dplyr::filter(continent == "Americas" |
continent == "Africa") %>%
dplyr::filter(year %in% years) %>%
group_by(continent, year) %>%
summarise(mean_infant_mortality = mean(infant_mortality, na.rm = TRUE)) %>%
ggplot(aes(x = year,
y = mean_infant_mortality)) +
geom_col(aes(fill = continent), position = "dodge")
Adding summary data to an existing plot
Now that we have the mean infant mortality for each year for the two continents, let’s add that data to the previous dot plot where we used jitter
mean_inf_mort <- gapminder %>%
dplyr::filter(continent == "Americas" |
continent == "Africa") %>%
dplyr::filter(year %in% years) %>%
group_by(continent, year) %>%
summarise(mean_infant_mortality = mean(infant_mortality, na.rm = TRUE))
gapminder %>%
dplyr::filter(continent == "Americas" |
continent == "Africa") %>%
dplyr::filter(year %in% years) %>%
group_by(continent) %>%
ggplot(aes(x = year,
y = infant_mortality)) +
geom_point(aes(colour = continent), position = "jitter") +
## summary data added to previous
geom_line(data = mean_inf_mort, aes(x = year,
y = mean_infant_mortality,
colour = continent), size = 2)
Filter data from a graph
In the figure above we can observe a number of countries in ‘Americas’ continent that have a child mortality that are above the average (over the years) of ‘Africa’. Which countries are this?
library(ggiraph)
gapminder$country <-
str_replace_all(string = gapminder$country,
pattern = "'",
replacement = "_")
interactive_inf_mort <- gapminder %>%
dplyr::filter(continent == "Americas" |
continent == "Africa") %>%
dplyr::filter(year %in% years) %>%
group_by(region, country) %>%
ggplot(aes(x = year,
y = infant_mortality)) +
geom_point_interactive(aes(tooltip = country, colour = region), position = "jitter") +
# geom_point(aes(colour = continent), position = "jitter") +
## summary data added to previous
geom_line(data = mean_inf_mort, aes(x = year,
y = mean_infant_mortality,
colour = continent, group = continent), size = 2
)
interactive_inf_mort
gapminder$country %>% as_factor() %>% levels()
## [1] "Albania" "Algeria"
## [3] "Angola" "Antigua and Barbuda"
## [5] "Argentina" "Armenia"
## [7] "Aruba" "Australia"
## [9] "Austria" "Azerbaijan"
## [11] "Bahamas" "Bahrain"
## [13] "Bangladesh" "Barbados"
## [15] "Belarus" "Belgium"
## [17] "Belize" "Benin"
## [19] "Bhutan" "Bolivia"
## [21] "Bosnia and Herzegovina" "Botswana"
## [23] "Brazil" "Brunei"
## [25] "Bulgaria" "Burkina Faso"
## [27] "Burundi" "Cambodia"
## [29] "Cameroon" "Canada"
## [31] "Cape Verde" "Central African Republic"
## [33] "Chad" "Chile"
## [35] "China" "Colombia"
## [37] "Comoros" "Congo, Dem. Rep."
## [39] "Congo, Rep." "Costa Rica"
## [41] "Cote d_Ivoire" "Croatia"
## [43] "Cuba" "Cyprus"
## [45] "Czech Republic" "Denmark"
## [47] "Djibouti" "Dominican Republic"
## [49] "Ecuador" "Egypt"
## [51] "El Salvador" "Equatorial Guinea"
## [53] "Eritrea" "Estonia"
## [55] "Ethiopia" "Fiji"
## [57] "Finland" "France"
## [59] "French Polynesia" "Gabon"
## [61] "Gambia" "Georgia"
## [63] "Germany" "Ghana"
## [65] "Greece" "Greenland"
## [67] "Grenada" "Guatemala"
## [69] "Guinea" "Guinea-Bissau"
## [71] "Guyana" "Haiti"
## [73] "Honduras" "Hong Kong, China"
## [75] "Hungary" "Iceland"
## [77] "India" "Indonesia"
## [79] "Iran" "Iraq"
## [81] "Ireland" "Israel"
## [83] "Italy" "Jamaica"
## [85] "Japan" "Jordan"
## [87] "Kazakhstan" "Kenya"
## [89] "Kiribati" "South Korea"
## [91] "Kuwait" "Kyrgyz Republic"
## [93] "Lao" "Latvia"
## [95] "Lebanon" "Lesotho"
## [97] "Liberia" "Libya"
## [99] "Lithuania" "Luxembourg"
## [101] "Macao, China" "Macedonia, FYR"
## [103] "Madagascar" "Malawi"
## [105] "Malaysia" "Maldives"
## [107] "Mali" "Malta"
## [109] "Mauritania" "Mauritius"
## [111] "Mexico" "Micronesia, Fed. Sts."
## [113] "Moldova" "Mongolia"
## [115] "Montenegro" "Morocco"
## [117] "Mozambique" "Namibia"
## [119] "Nepal" "Netherlands"
## [121] "New Caledonia" "New Zealand"
## [123] "Nicaragua" "Niger"
## [125] "Nigeria" "Norway"
## [127] "Oman" "Pakistan"
## [129] "Panama" "Papua New Guinea"
## [131] "Paraguay" "Peru"
## [133] "Philippines" "Poland"
## [135] "Portugal" "Puerto Rico"
## [137] "Qatar" "Romania"
## [139] "Russia" "Rwanda"
## [141] "St. Lucia" "St. Vincent and the Grenadines"
## [143] "Samoa" "Saudi Arabia"
## [145] "Senegal" "Serbia"
## [147] "Seychelles" "Sierra Leone"
## [149] "Singapore" "Slovak Republic"
## [151] "Slovenia" "Solomon Islands"
## [153] "South Africa" "Spain"
## [155] "Sri Lanka" "Sudan"
## [157] "Suriname" "Swaziland"
## [159] "Sweden" "Switzerland"
## [161] "Syria" "Tajikistan"
## [163] "Tanzania" "Thailand"
## [165] "Timor-Leste" "Togo"
## [167] "Tonga" "Trinidad and Tobago"
## [169] "Tunisia" "Turkey"
## [171] "Turkmenistan" "Uganda"
## [173] "Ukraine" "United Arab Emirates"
## [175] "United Kingdom" "United States"
## [177] "Uruguay" "Uzbekistan"
## [179] "Vanuatu" "Venezuela"
## [181] "West Bank and Gaza" "Vietnam"
## [183] "Yemen" "Zambia"
## [185] "Zimbabwe"
ggiraph(ggobj = interactive_inf_mort)
A more complicated example (for showing the capabilities of ggplot2)
west <- c("Western Europe","Northern Europe","Southern Europe",
"Northern America","Australia and New Zealand")
gapminder <- gapminder %>%
mutate(group = case_when(
region %in% west ~ "The West",
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
TRUE ~ "Others"))
gapminder <- gapminder %>%
mutate(group = factor(group, levels = rev(c("Others", "Latin America", "East Asia","Sub-Saharan Africa", "The West"))))
filter(gapminder, year%in%c(1962, 2013) & !is.na(group) &
!is.na(fertility) & !is.na(life_expectancy)) %>%
mutate(population_in_millions = population/10^6) %>%
ggplot( aes(fertility, y=life_expectancy, col = group, size = population_in_millions)) +
geom_point(alpha = 0.8) +
guides(size=FALSE) +
theme(plot.title = element_blank(), legend.title = element_blank()) +
coord_cartesian(ylim = c(30, 85)) +
xlab("Fertility rate (births per woman)") +
ylab("Life Expectancy") +
geom_text(aes(x=7, y=82, label=year), cex=12, color="grey") +
facet_grid(. ~ year) +
theme(strip.background = element_blank(),
strip.text.x = element_blank(),
strip.text.y = element_blank(),
legend.position = "top")
Optional (Data Distributions & Outliers)
Detecting outliers
For this part we use a different and more simple dataset This dataset contains 1192 observations on self-reported:
height(inch)earn($)sex(gender)ed(currently unannotated)age(years)race
heights_data <- read_csv(file = file.path(root,
"data",
"extra",
"heights_outliers.csv"))
heights_data
## # A tibble: 1,192 × 6
## earn height sex ed age race
## <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 50000 74.4 male 16 45 white
## 2 60000 65.5 female 16 58 white
## 3 30000 63.6 female 16 29 white
## 4 50000 63.1 female 16 91 other
## 5 51000 63.4 female 17 39 white
## 6 9000 64.4 female 15 26 white
## 7 29000 61.7 female 12 49 white
## 8 32000 72.7 male 17 46 white
## 9 2000 72.0 male 15 21 hispanic
## 10 27000 72.2 male 12 26 white
## # … with 1,182 more rows
Data characteristics
We will focus on the variable height here
summary_heights_data <- heights_data %>%
group_by(sex, age) %>%
summarise(mean_height = mean(height, na.rm = TRUE),
min_height = min(height),
max_height = max(height)) %>%
arrange(desc(mean_height))
summary_heights_data[c(1:4),]
## # A tibble: 4 × 5
## # Groups: sex [2]
## sex age mean_height min_height max_height
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 female 55 141. 61.9 664.
## 2 male 39 134. 66.6 572.
## 3 male 55 73.2 71.7 74.8
## 4 male 91 73.1 73.1 73.1
From the above summary we can conclude that there are two outliers (presumably entry errors).
Calculate the height in meters for each outlier in the
Console 1 inch = 0,0254 meters
Please discuss the solution with your neighbour
Checking the frequency distribution
heights_data %>%
ggplot(aes(x = height)) +
geom_histogram(aes(stat = "identity"), bins = 200)
This distribution looks odd. When you see a large x-axis with no data plotted on it, it usually means there is an outlier. If you look carefully, you will spot two outliers around 600
Boxplots to detect outliers
heights_data %>%
ggplot(aes(y = height)) +
geom_boxplot()
So apparantly there is one data point that is way off from the rest
of the distribution. Let’s remove this point, using
filter() from the {dplyr} package like we did
before on the gapminder dataset.
heights_data %>%
dplyr::filter(height < 100) %>%
ggplot(aes(y = height)) +
geom_boxplot()
## by sex
heights_data %>%
dplyr::filter(height < 100) %>%
ggplot(aes(y = height, x = sex)) +
geom_boxplot()
New frequency distribution
Now let’s plot a new distribution plot, this time we plot density, leaving the outlier out
heights_data %>%
dplyr::filter(height < 100) %>%
ggplot(aes(height)) +
geom_freqpoly(aes(y = ..density..))
## by sex
heights_data %>%
dplyr::filter(height < 100) %>%
ggplot(aes(height)) +
geom_freqpoly(aes(y = ..density.., colour = sex))
Checking normality with a qqplot
## a qqplot provides a visual aid to assess whether a distribution is approaching normality
source(file = file.path(root, "R", "ggqq.R"))
height_data_outlier_removed <- heights_data %>%
dplyr::filter(height < 100)
gg_qq(height_data_outlier_removed$height)
## 25% 75%
## 66.926998 4.328462
## formal statistical proof
shapiro.test(height_data_outlier_removed$height)
##
## Shapiro-Wilk normality test
##
## data: height_data_outlier_removed$height
## W = 0.98485, p-value = 8.491e-10
all data -> reject hypothesis that the sample has a normal distribution
Test individual distributions
males <- height_data_outlier_removed %>%
dplyr::filter(sex == "male")
females <- height_data_outlier_removed %>%
dplyr::filter(sex == "female")
shapiro.test(males$height)
##
## Shapiro-Wilk normality test
##
## data: males$height
## W = 0.99053, p-value = 0.002532
shapiro.test(females$height)
##
## Shapiro-Wilk normality test
##
## data: females$height
## W = 0.99277, p-value = 0.002105
## add shapiro for each sex
## we can do the same for age
shapiro.test(males$age)
##
## Shapiro-Wilk normality test
##
## data: males$age
## W = 0.93358, p-value = 3.506e-14
shapiro.test(females$age)
##
## Shapiro-Wilk normality test
##
## data: females$age
## W = 0.93978, p-value = 4.862e-16